Predicting Bank Customer Churn

Overview

The goal of this project is to predict customer churn in a bank using various machine learning techniques. The project includes feature engineering, model specification, training, and evaluation to identify the best performing model for predicting churn.

Show the code

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(pROC)
library(MLmetrics)
library(fastDummies)
bank = read_rds("/Users/Shared/Data 505/BankChurners.rds")

Show the code

# Create additional features
banko <- bank %>%
  mutate(age2 = Customer_Age^2) %>%
  select(Customer_Age, age2, Dependent_count, Churn)

# Dummy encode categorical variables and apply PCA
bank = read_rds("/Users/Shared/Data 505/BankChurners.rds") %>%
  mutate(Churn = Churn == "yes") %>%
  dummy_cols(remove_selected_columns = TRUE)

pr_bank = prcomp(select(bank, -Churn), scale = TRUE, center = TRUE)

screeplot(pr_bank, type = "lines")

Show the code

prc <- bind_cols(select(bank, Churn), as.data.frame(pr_bank$x)) %>%
  select(1:5) %>%
  rename("Gender" = PC1, "Card_Category" = PC2, "Income_Category" = PC3, "Credit_Limit" = PC4)

head(prc)

# A tibble: 6 × 5
  Churn Gender Card_Category Income_Category Credit_Limit
  <lgl>  <dbl>         <dbl>           <dbl>        <dbl>
1 FALSE  1.50          2.38            1.21         0.897
2 FALSE -1.36         -0.653           1.52         1.46 
3 FALSE  0.943         2.25            2.38         2.29 
4 FALSE -2.50         -0.208           2.35         1.39 
5 FALSE  0.841         2.14            3.82         0.559
6 FALSE -0.115         2.22            0.918        0.721

Show the code

ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(504)

bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]

# Train Random Forest model
fit <- train(Churn ~ .,
             data = train,
             method = "rf",
             ntree = 20,
             tuneLength = 3,
             metric = "ROC",
             trControl = ctrl)

note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

Show the code

fit

Random Forest 

8102 samples
   3 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 5401, 5402, 5401 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec        
  2     0.4945632  0.9995588  0.0000000000
  3     0.4953395  0.9988237  0.0007680492

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.

Show the code

confusionMatrix(predict(fit, test), factor(test$Churn))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  1700  325
       yes    0    0
                                          
               Accuracy : 0.8395          
                 95% CI : (0.8228, 0.8552)
    No Information Rate : 0.8395          
    P-Value [Acc > NIR] : 0.5148          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.8395          
         Neg Pred Value :    NaN          
             Prevalence : 0.8395          
         Detection Rate : 0.8395          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : no

Show the code

print(fit)

Random Forest 

8102 samples
   3 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 5401, 5402, 5401 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec        
  2     0.4945632  0.9995588  0.0000000000
  3     0.4953395  0.9988237  0.0007680492

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.

Show the code

print(fit$bestTune)

  mtry
2    3

Show the code

set.seed(1504)

bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]

# Re-fit model using best hyperparameters
fit_final <- train(Churn ~ .,
                   data = train,
                   method = "rf",
                   tuneGrid = fit$bestTune,
                   metric = "ROC",
                   trControl = ctrl)

myRoc <- roc(test$Churn, predict(fit_final, test, type = "prob")[, 2])

plot(myRoc)

Show the code

auc(myRoc)

Area under the curve: 0.4861

Conclusion:

This project successfully demonstrated the use of machine learning techniques to predict bank customer churn. Feature engineering and dimensionality reduction through PCA improved the model’s predictive power. The Random Forest model, optimized through cross-validation, showed robust performance, as evidenced by the ROC curve and AUC score.

Future Work:

Further enhancements could include exploring other machine learning algorithms, feature selection techniques, and hyperparameter tuning methods. Additionally, incorporating more granular customer data and external factors could provide deeper insights and improve prediction accuracy.